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Abstract 

Single Index Models (SIMs) are simple yet flexible semi-parametric models for classification and regression. 
Response variables are modeled as a nonlinear, monotonic function of a linear combination of features. Estimation 
in this context requires learning both the feature weights, and the nonlinear function. While methods have been 
described to learn SIMs in the low dimensional regime, a method that can efficiently learn SIMs in high dimensions 
has not been forthcoming. We propose three variants of a computationally and statistically efficient algorithm for 
SIM inference in high dimensions. We establish excess risk bounds for the proposed algorithms and experimentally 
validate the advantages that our SIM learning methods provide relative to Generalized Linear Model (GLM) and low 
dimensional SIM based learning methods. 


1 Introduction 

High-dimensional learning is often tackled using generalized linear models, where we assume that a response variable 
Y £ R is related to a feature vector X £ via 

E[Y\X = x] = g^wjx) (1) 

for some weight vector in* * * § £ and some monotonic and smooth function <y, called the transfer function. Typical 
examples of <?* are the logit function and the probit function for classification, and the linear function for regression. 
While classical work on generalized linear models (GLMs) assumes g r is known, this potentially nonlinear function 
is often unknown and hence a major challenge in statical inference. 

The model in (|T]> with g* unknown is called a Single Index Model (SIM) and is a powerful semi-parametric gener¬ 
alization of a GLM . SIMs were first introduced in econometrics and statistics ED IS El- Recently, computationally and 
statistically efficient algorithms have been provided for learning SIMs BOO in low-dimensional settings where the 
number of samples/observations n is larger than the ambient dimension d. However, modern data analysis problems in 
machine learning, signal processing, and computational biology involve high dimensional datasets, where the number 
of parameters far exceeds the number of samples (n -C d). 

In this paper we consider the problem of learning SIMs, given labeled data, in the liigh-dimensional regime. We 
provide algorithms that are both computationally and statistically efficient for learning SIMs in high-dimensions, and 
validate our methods on several high dimensional datasets. Our contributions in this paper can be summarized as 
follows: 
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1. We propose a suite of algorithms to learn SIMs in high dimensions. Our simplest algorithm called SILO (Single 
Index Lasso Optimization) is a simple, non iterative method that estimates the vector w, and a monotonic, 
Lipschitz function g*. iSILO and ciSILO are iterative variants of SILO that use different loss functions. While 
iSILO uses a squared loss function, ciSILO uses a calibrated loss function that adapts to the SIM from which 
our data is generated. 

2. We provide excess risk bounds on the hypotheses returned by SILO, iSILO, ciSILO. 

3. We experimentally compare our algorithms with other methods used both for SIM learning and high dimensional 
parameter estimation on various real world high dimensional datasets. Our experimental results show superior 
performance of iSILO and ciSILO when compared to commonly used methods for high dimensional estimation. 

The rest of the paper is organized as follows: In Section Q, we formally set up the problem we wish to solve, and 
detail the proposed methods, SILO, iSILO, ciSILO. In Section ([3]), we perform a theoretical analysis of SILO, iSILO, 
and ciSILO . We perform a thorough empirical evaluation on several datasets in Section <[4]», and conclude our paper 
in Section ([5]). Full proofs of our theoretical analysis are available in the appendix. 

1.1 Related work 

High dimensional parameter estimation for GLMs has been widely studied, both from a theoretical and algorithmic 
point of view ( fl5l 0 j9) and references therein). Learning SIMs is a harder problem and was first introduced in 
econometrics a and statistics 0. In a the authors proposed and analyzed the Isotron algorithm to learn SIMs 
in the low dimensional setting. Isotron uses perceptron type updates to learn w+, along with application of the Pool 
Adjacent Violator (PAV) algorithm to learn g*. This was improved in a where the authors proposed the Slisotron 
algorithm that combined perceptron updates to learn w k along with a Lipschitz PAV (LPAV) procedure to learn g k . 
Both the Isotron and the Slisotron algorithm rely on performing perceptron updates. While the perceptron algorithm 
works for low-dimensional classification problems, to the best of our knowledge the performance of the perceptron 
algorithm has not been studied in high-dimensions. Hence, it is not clear if the Isotron and the Slisotron algorithms 
designed for learning SIM in low-dimensions would work in the high dimensional setting. 

Alquier and Biau [lj consider learning high dimensional single index models. The authors provide estimators of 
g*. in* using PAC-Bayesian analysis. However, the estimator relies on reversible jump MCMC, and it is seemingly 
hard to implement. Also, the MCMC step is slow to converge even for moderately sized problems. To the best of 
our knowledge, simple, practical algorithms with theoretical guarantees and good empirical performance for learning 
single index models in high dimensions are not available. Restricted versions of the SIM estimation problem have 
been considered in mm where the authors are only interested in accurate parameter estimation and not prediction. 
Hence, in these works the proposed algorithms do not learn the transfer function. 

The LPAV: Before we discuss algorithms for learning high dimensional SIMs, we discuss the LPAV algorithm 
proposed in f5J, as an extension to the PAV method used in f6|. Given data (pi,yi), ... ( PniVn), where pi, ... ,p n £ M 
the LPAV outputs the best univariate monotonic, 1-Lipschitz function g, that minimizes squared error ^" =] (<j(pi) — 
yi ) 2 . In order to do this, the LPAV first solves the following optimization problem: 

z = arg min \\z — y\\\ s.t. 0 < Zj — Zi < pj — pi if pt < pj (2) 

where g{pi) = This gives us the value of g on a discrete set of points pi 1 ... ,p n . To get g everywhere else on the 
real line, we simply perform linear interpolation as follows: Sort p, for all i and let be the i th entry after sorting. 
Then, for any ( e K, we have 

{ %}, ifC<P{i} 

Z{u}, ifC>P{n} (3) 

+ (! - fO%+i} if C = PP{i} + (1 - g)P{i+ 1} 

In the algorithms that we shall discuss in this paper we shall invoke the LPAV routine with pi set to the projection of 
the data point Xi on some algorithm-dependent weight vector w. 
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2 Statistical model and proposed algorithms 

Assume we are provided i.i.d. data {(xi,yi ),..., (x n , y n )}, where the label Y is generated according to the model 
E[V|X = x\ = g+(wjx) for an unknown parameter vector w k G R' ( n <C d and unknown 1-Lipschitz, monotonic 
function g *. We additionally assume that y £ [0,1], ||nt *||2 < 1 and ||tu*||o < s, where || • || 0 is the £ 0 pseudo-norm. 
The sparsity assumption on w+ is motivated by the fact that consistent estimation in high dimensions is an ill-posed 
problem without making further structural assumptions on the underlying parameters. 

Our goal is to make predictions on unseen data. Specifically, we would like to provide estimators g and w of g k 
and w k so that given a previously unseen sample x, we predict y = g(w T x). To this end, we propose three algorithms 
that we explain next 

2.1 SILO: Single Index Lasso Optimization 

We first propose SILO, a simple SIM learning algorithm that first learns w and then fits a function g using w. Specif¬ 
ically, SILO performs the following two steps in a single pass: 

1. In order to learn w we solve the problem that was first proposed in liTOl . This optimization problem is indepen¬ 
dent of the transfer function g k and minimizes a linear loss subject to model constraints: 

1 n 

w = are min- iiixjw. (4) 

IMIi<\A 1 

where the constraint ||ut||i < yfs arises from constraining an s—sparse vector to have unit Euclidean norm. 

2. After learning w, SILO simply fits a 1-Lipschitz monotonic function by invoking the LPAV routine with the 
vector p = [pi,... ,p n \, where p., = w T Xi. LPAV outputs a function g. Our final predictor has the form 
V = g(w T x). 

Note that there is no need to re-learn w after learning //, since the optimization problem to learn w is independent of 
g. This property makes SILO a very simple and a computationally attractive algorithm. 

2.2 iSILO: Iterative SILO with squared loss 

SILO is computationally very efficient, since it only involves learning w , g once. However, completely ignoring g to 
learn w could be suboptimal, and we propose two algorithms to overcome this drawback. We first propose iSILO, an 
iterative method detailed in Algorithm [I] Given the model in {T]t, iSILO minimizes the squared loss with a sparsity 
penalty to estimate w, g: 


. 1 

w,g = argmin — 
■w,g n 


X]( Vi - g( wTx i )) 2 + A|Mli. 

i—1 


(5) 


We adopt an alternating minimization prodecure. In iteration f, given g t ~ i, we would ideally perform a proximal point 
update w.r.t. w to obtain 


( n 

w t = Prox Ar , ( ||. ; w t -i - - ^(gt-iiwJ^Xi) - y^g'^wf^x^x, 

V n i= i 

where Prox(-) is the soft thresholding operator associated with the || ■ ||i norm, ?/ > 0 is an appropriate step size, and 
g' t is the derivative of g t . Unfortunately, the above gradient step requires us to estimate the derivative of g t , which can 
be difficult. So, instead of performing the above proximal gradient update, we instead perform a proximal perceptron 
type update similar in spirit to Eiia , by replacing g' t _ 1 by the Lipschitz constant of gt-i- Since g t -\ is obtained using 
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Algorithm 1: iSILO 

Require: Data: A' = [x \,..., x n \, Labels: y = \y -\...., y n \ 1 , Regularization: A, Step size //, Initial parameters: go 
is 1-Lipschitz, monotonic function, Wq € R d , Iterations: T > 0. 

1 : Initialize w = w 0l g = go- 
2 : opterr = MSE(w 0 , go) 

3: for t= 1. .T do 

4: Perform the update shown in Equation ([6]) to get w t . 

5: Calculate err = MSE(w t ,gt-i)- 

6: if err < opterr then 

7: opterr = err . 

8: w = w t ,g = g t -i 

9 : end if 

10: Obtain g t by solving problem ([2]) with p r = wj x, and linear interpolation ([3]) 

11 : Calculate err = MSE(w t) gt). 

12: if err < opterr then 

13: opterr = err. 

14: w = w tl g = g t 

15: end if 

16: end for 
17: Output w, g 


the LPAV algorithm, g t -1 is 1— Lipschitz. Note that unlike the perceptron, we have a non unity step size. This leads 
to the following update equation 

w t = Prox Aj)! ,. ] | 1 ^ Y^gt-^wJ^Xi) - yi)x^j (6) 

Given w t in iteration /, iSILO updates g t to be the solution to the LPAV problem with p, = wj x,. 

The non-convexity of <[5]> requires us to to perform a book-keeping procedure that keeps track of the best estimate 
of g , w by calculating the MSE of the current hypothesis on a held-out validation set. This is done in steps 5-9 and 
12-16 of Algorithms [I] Similar book-keeping procedures have been used in the Isotron, and Slisotron algorithms 

of® 0 . 


2.3 ciSILO: Iterative SILO with calibrated loss 

iSILO like the Slisotron algorithm |5l use a squared loss function and an approximate gradient descent method to 
estimate in*. These methods do not take into account the derivative of the estimate of the transfer function while 
taking gradient descent steps. We now propose ciSILO, a version of SILO that uses a calibrated loss function that 
adapts to the SIM that we are trying to learn. 

Suppose was known. Let <!>* : R —> R be a function such that = < 7 *. Since 5 * is monotonically increasing, 
<[> k is convex, and we can learn w by solving the following convex program: 


w 


1 V—-V 

- V - y l w T x i + A||«t||i 

n ' 

2=1 


(7) 


When the transfer function is linear, is a quadratic function, and we obtain the standard Lasso problem that min¬ 
imizes squared loss with i\ penalty. When the transfer function is the logit function, 0 reduces to sparse logistic 
regression. Modulo, the l\ penalty term the above objective is a sample version of the following stochastic optimiza¬ 
tion problem: 

minE[<b*(w T a;) — yw T x\. ( 8 ) 


4 






If ‘\>\ = g*, then the optimal solution to the above problem corresponds to the single index model that satisfies 
E[Y\X = x] = g*(wj x). Hence the above calibrated loss function takes into account the transfer function p* used 
in the SIM via <t>* and automatically adapts to the SIM from which the data is generated. When p* is unknown, we 
instead consider the following optimization problem: 


. 1 

w, g = arg mm — 
w,g n 


y, $(w T Xi) 

i =1 


ViW T Xi + A||m||i 


s.t. p = £ Q 


(9) 


where the set Q = {p : R. —>■ R. is a 1-Lipschitz, monotonic function}. Note that the above optimization problem 
optimizes for p via its integral <1>. ciSILO solves the above optimization problem by iteratively minimizing for w. g. 
The pseudo-code for ciSILO is given in Algorithm [2] There are three key update procedures performed in each 
iteration of ciSILO, which we explain below: 

In Step 4, ciSILO fixes p to g t - 1 and performs one step of a proximal point update on the objective in problem |9]) 
w.r.t. w to get: 

w t = Prox A7)j ||.|| 1 ^ . (10) 

This step is identical to the update step in iSILO except that the g' t _ 1 does not feature in this update. Thus, the proximal 
point steps using a calibrated loss function can be performed exactly unlike the proximal point steps in iSILO . 

The use of a calibrated loss function brings with it another challenge: The LPAV procedure, which was designed 
to minimize the squared loss, can no longer be used in ciSILO to estimate p*. ciSILO instead uses a novel quadratic 
program to efficiently estimate p*. From the first order optimality conditions of the optimization problem ([9]> for w at 
w t we get that the optimal function g t should satisfy 


1 -y > 

- y(gt(wjxi) - yi)xi + X(3 t = 0, & d\\w t \\ x . (11) 

i =t 

p t is updated such that L.H.S. of ( fTTj ) has the smallest possible norm. This can be cast as a quadratic program 
(QP) as follows: Define, p = [p 1; ... ,p„] T , where = wjXi and z = [z x ,... ,z„] T , where z t = gt{pi)- Let 
X = [xi,... , x n ]x be a d x n data matrix. Let q = n\/3 — X T y. Now, solve the problem 


min \\X T z + q \\2 

Z 

s.t. 0 < z, < 1 V* and 0 < Zj — Zi < Pj — Pi if pt < Pj 


( 12 ) 


We call optimization problem <[T2j QPFit, which is different from the LPAV given that it is derived from optimizing a 
calibrated loss function, which could be very different from the squared loss. 


2.4 Initializing iSILO and ciSILO 

Since both iSILO and ciSILO are non-convex, alternating minimization procedures, a good initialization is key to 
achieving good performance. A simple initialization would be to choose w° randomly and p° to be the identity 
function. However, we initialize both methods with w. g obtained by running the (efficient) SILO algorithm from 
Section |2.1| We demonstrate in the next section that this yields very good theoretical guarantees, as well as good 
empirical performance in Section[4] 

Remarks : Like in iSILO we perform book-keeping steps in ciSILO too. Since obtaining exact or approximate 
gradients in iSILO and ciSILO are easy we use first order methods to solve for w. Using line search methods in 
ciSILO, to compute step sizes, would require evaluating the calibrated loss function. This can be computationally 
intensive, since we have access to the calibrated loss function only via its gradient. Hence, in iSILO, and ciSILO we 
use a fixed step size to perform our updates. Despite the use of fixed step size, we show empirically that iSILO is 
often as competitive and sometimes better at making predictions than GLM based methods with optimal step sizes, 
and ciSILO is significantly superior. 
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Algorithm 2: ciSILO 

Require: Data: X = [xi,..., x n ]. Labels y = [yi,... ,y n \ T , , Regularization parameter A, step size r/. Initial 
parameters: Wq £ R d ,po : R —► R. is 1-Lipschitz, monotonic function, 
l: Initialize w = w 0 , g = go- 
2 : opterr = MSE(w 0 , go) 

3: for t=l,2,.. .T do 

4: Perform the update step shown in Equation ( p~Q] > to obtain w t . 

5: Calculate err = MSE(w t: gt-i)- 

6: if err < opterr then 

7: opterr = err . 

8: w = w t ,g = g t -i 

9: end if 

10: Calculate: p -s— Xw t ,(3 £- 9||ut t ||i, q £- n\j3 — X T y 

11: Obtain g t by solving problem o and linear interpolation. 

12: Calculate err = MSE(w t: g t ). 

13: if err < opterr then 

14: opterr = err. 

15: w = w tl g = g t 

16: end if 

17: end for 
18: Output w. g 


3 Theoretical analysis of SILO, iSILO and ciSILO 

In this section, we analyze the excess risk of the predictors output by iSILO, and ciSILO . For a given hypothesis 
h(x) = g(w T x), define err(h) := E (h(x) — y) 2 . The excess risk is then defined as 

£(h) := err(h) - err (ft*) = E (y - h(x)) 2 - E (y - g+(wj x)) 2 (13) 

We first list the technical assumptions we make: 

Al. The data xi ,..., x n is sampled i.i.d. from the standard multivariate Gaussian distribution. 

A2. K[Y\X = x] = g+(wjx), and 0 < Y < 1, 

A3. 3 * is monotonic and 1— Lipschitz, 

A4. ||u?*||o < S, 11-U7* 11 2 < 1, Iloilo < k, and k < d. 

We provide sketches of relevant results in this section, and refer the interested reader to the Appendix for detailed 
proofs. Our first main result is an excess risk bounds for SILO: 

Theorem 1. Let h(x) = g(w T x) be the hypothesis output by SILO. Let 0 = ^^N(p,i)g*{p)p > 0. Then under 
assumptions A1-A4, the excess risk of the predictor h is, with probability at least 1 — 5, bounded from above by 

where O hides factors that are poly-logarithmic in n,d, y, s and k. 
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Proof Sketch: For notational convenience, denote by e 2 = 1 y Cs lo sOd/s) ; where C > 0 is a universal constant. 
WLOG, we can assume that ||z£> Ho < s. Our assumption on the sparsity of w is pretty lenient, and is most often 
satisfied in practice. Also, since w is obtained from SILO , we have |11 2 < 1, ||tt>|| 1 < \fs. From a result of Plan 
and Vershynin fTOl Corollary 3.1] (Lemma 4 in appendix), we know that Hut* — w \| 2 < e 2 . The excess risk £(h) can 
be bounded as follows. 

£{h) = E[(<?(m T a:) - y) 2 - (g*(wjx) - y ) 2 ] = E(<)(m T :r) - g+(wjx )) 2 
= E(g(ib T a:) - g(wjx) + g(wjx) - g+(wjx)) 2 

< 2(s + k)e 2 log(2d) + 2 E(g(itjJ x) — g+(wlx)) 2 with probability at least 1 — 5 


where we used the fact that g is 1-Lipschitz, and upper bounds on the expected suprema of a collection of Gaussian 
random variables. Next, we shall bound the R.H.S. of the above equation. 


K (gi w J x ) - g+( w J x )) 2 < x ) - y ) 2 - ^(g*( w J x ) - vf 


^ 1 n / 

< - y2(g(wj Xi) - yt) 2 - (g*(wj Xi) - y t ) 2 + 6 ( 

n 1 V 

1=1 


; s io g (2d)) 1 /n 

J 


In inequality (a) we used a certain projection inequality for convex sets (see Lemma 1 in appendix). To obtain in¬ 
equality (b) we replace the expected value quantities with their empirical versions, plus deviation terms. Via standard 
application of large deviation inequalities, it is possible to establish that these deviations are 0( ( slog (^^ —j (see 
Lemma 5 in appendix). The proof concludes by upper bounding the empirical term in the above equation using 
optimality of g and properties of maxima of a collection of Gaussian random variables. 

Our next result is an upper bound on the excess risk bounds of iSILO and ciSILO: 

Theorem 2. Suppose g , w are the outputs of SILO on our data. Let h(x) = g(w T x) be the hypothesis corresponding 
to these outputs. Let h+(x) = g±(wj x). Now, let ft t be the output of ciSILO obtained by using g, w as initializers. 
Then under the assumptions A1-A4, with high probability we can bound the excess risk ofhT by 

r ( V, <o (<i + _L(1)‘ vT^+AErfMj) + 

where O hides factors that are poly-logaritlimic in n,d , s, k. Moreover, the same excess risk guarantees hold for 
Iit obtained by running iSILO . 


Proof Sketch : From Theorem [I] we know that 

£{h) = err (ft) - err (ft*) < 6 ^ ( s + fc )J°gM) ^/f + J_ 4 + fc) l 0 g(2d)^ 

Using standard large deviation arguments (see Lemma 6 in appendix) we can claim that | err(ft) — err (ft) = O(^f^) 
with probability at least 1 — 5. This gives us 


err (ft) = err (ft) + O 


= err (ft*) + err (ft) — err (ft*) + O 


n \ A / (s + k) log( 2 d) fs 1 /s\i n -p-j— 77tk\ slog 2 (2d + 1 ) 

= err(/i.) + O f + Te (~) Vo + k) log<22) ) + ^ - L 


Now consider hj- obtained by running either ciSILO or iSILO for T iterations, when initialized with w. g obtained 
by running SILO first on the data. Since hr is chosen by using a held-out validation set as the iterate corresponding 
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Figure 1: Errors rates are normalized so that the Slisotron has an error of 1. Note that ciSILO consistently outperforms 
all other methods, and iSILO is very competitive. The numbers below each dataset refer to (n. d ) 


to the smallest validation error, we can claim via Hoeffding inequality that the empirical error of h T cannot be too 
much larger than that of h (for otherwise hr will not be the iterate with the smallest validation error). Precisely, if the 
validation set is of size n, then with high probability eir(/tT) < err (h) + O (^ 7 ) ■ Using the above inequalities, and 
via standard large deviation arguments to bound | err (hr) — eri^hr)] we get the desired result. 

Remarks : In the bound of Theorem |5J the first term in O dominates, and the excess risk bound is essentially 
() ^ (s+fc) iog( 2 rf) Also, using the output of SILO to initialize iSILO and ciSILO yields strong theoretical guar¬ 

antees. 

The constant 9 in our results: 9 acts like the signal to noise ratio in our results. The larger 9 is, the better our bound 
gets. For example, for the logistic model, 9 is approximately the norm of the data (~ y / Iog (d)). For measurements of 
the form y = sign(x T w), 9 is a constant. 9 < 0 can be easily tackled by reversing the signs of y, and 9 = 0 implies 
that the data and observations are uncorrelated, and naturally any error bound will be meaningless. 

Comparisons to existing results in low dimensions: In Q the authors obtained dimension dependent as well as 
dimension independent bounds on the prediction error for the Slisotron algorithm for the SIM problem. However, 
these results were obtained under the restrictive assumption that ||in +||2 < W. \\x\\ 2 < B, and both W, B are fixed and 
independent of dimensions.^] In order to carry through a correct high-dimensional analysis, one needs to let either W 
or B or both grow with d. In our analysis, we assume that the data is sampled from a standard multi-variate Gaussian, 
and hence ||a :||2 < Vd with high probability. If one were to replace B with \fd in the results of j5}, then the excess 
risk of their predictor would scale as minl^l^-, and since d n, their bounds are meaningless in the high¬ 

dimensional setting. In contrast our results in Theorem [2] have a (poly)-logarithmic dependence on d, and hence are 
useful in the high dimensional setting studied in this paper. The same arguments apply to the results of ( 6 ), where in 
addition one needs a fresh batch of samples at each run. 

4 Experimental results 

We tested our algorithms SILO, iSILO, and ciSILO on many real world high dimensional datasets. For comparison 
with methods that assume g known, we used Sparse Logistic Regression (SLR), and Sparse Squared Hinge Loss 
minimization (SHL) mE . We also tested the Slisotron a algorithm designed for low-dimensional SIM. For each 
dataset we randomly chose 60% of the data for training, and 20% each for validation and testing. The parameters A, // 
are chosen via validation. Mac-Win, Crypt-Elec, Atheism-Religion and Auto-Motorcycle are from the 20 Newsgroups 

1 In their analysis B = 1. 

2 code downloaded from http: / /www. cs . ubc. ca/ - schmidtm/Software/LlGeneral. html 
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dataset. Arcene is from the NIPS challenge^] and the Page dataset is obtained form the WebKB dataset HQ Prostrate 
and Colon cancer datasets are available online 0. 

Figure [T] shows the misclassification error obtained on the test set. We show results for 8 datasets of varying size. 
Additional results are available in the supplementary material. Since the datasets (and errors) are varied, we normalize 
the error rates so that the Slisotron has unit error. As we can see from these results, using the calibrated loss in 
ciSILO yields the best performance in all the datasets considered, except Mac Win. iSILO is as good as or better than 
SLR in 6/8 cases. It is encouraging to note that iSILO and ciSILO do well despite not having the luxury of choosing 
optimal step sizes at each iteration. Finally, the relatively poor performance of SILO underlines the importance of 
iterative methods in the SIM learning setting. 


5 Conclusions 

In this paper, we introduced a suite of algorithms based on sparse parameter estimation for learning single index 
models in the high dimensional setting. We derived excess risk guarantees for the proposed methods. Our algorithm 
employing a calibrated loss and a novel quadratic programming method to fit the transfer function achieves superior 
results compared to standard high dimensional classification methods based on minimizing the logistic or the hinge 
loss. In the future we plan to investigate learning single index models with structural constraints other than sparsity 
such as low rank, group sparsity, and indeed other very general constraints. 
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A Preliminaries 


We shall need a few definitions and a few important lemmas and propositions before we can state the proofs of our 
theorems. We shall consider the following function class. 

Q = {g : [-W, W] —> [0, 1 ], g is 1-Lipschitz and monotonic}. (15) 

Though the above definition of Q uses an unspecified parameter W, most often we shall use W = yjs log(2d). The 
following result concerning suprema of a collection of i.i.d. Gaussian random variables is standard and we shall state 
it without proof. 

Proposition 1. Let \gfffL 1 be a collection ofm i.i.d. Gaussian random variables with mean 0 and variance er 2 . Then, 

max \gi\ < a ( \/log( 2 to) + ^21og(2/5)) w.p. >1-5 
[m] V / 

The next lemma is standard and a proof can be found in Lemma 9 in El- 

Lemma 1. Let T be a convex class of functions, and let f* = argminy 6 jrE(/(:r) — y ) 2 . Suppose that E[Y|A' = 
x\ = g+(wj x) for some g * £ Q. Then for any /€ J, the following holds true 

mm - v ) 2 ] - mm yf] > mm - r (*)) 2 ] ae) 

Lemma 2. Let x £ R d be a standard normal random vector. Then with probability at least 1 — 5 

wjx < 0{y/ slog ( 2 d)) 

Proof. The proof follows immediately from Proposition 0 and the fact that || to* || i < y/s. □ 

Lemma 3. Let e £ M. d be such that ||e||o < s + k and ||e ||2 < e. Let x be a standard normal random vector. Then 
with probability at least 1 — 5 

e T x < 0(ey/ (s + k) log( 2 d)) 

Proof. Let e = [ei,..., ef- Similarly, let x = [xy,..., xf\. We then have 

d 

e T x = '^2,e i x i (17) 

2—1 

d 

< max \xj \ y \ej\ (18) 

2=1 

s-\-k 

< v/l°g(2ci)^ |ei| w.p 1 - 5 (19) 

2=1 

(b) , - 

< cyj (s + k) log( 2 d). ( 20 ) 

In obtaining inequality (a) we used the fact that the max of the absolute value of d Gaussian random variables is 
bounded by i/log(2 d). In equality (b) we used the fact that ||e||o < s + k, and hence only s + k of the elements of e 
are non-zero. □ 


We next need the following important result (Corollary 3.1 in flOl l 

Lemma 4. . Let W = {w £ R d : || || 2 < 1, ||tu||i < v/s}. Let w be obtained from SILO, shown in the main 

paper. Suppose, w £ W. Let Xy,... x n be n independent Gaussian random vectors. Assume that the measurements 
E[Y|A = a:] = g+(wjx), where ||wt *||2 < 1, ||tu*||o < s. Then with probability at least 1 — 5, the solution w 
obtained from SILO satisfies the inequality 


|I'm - w*\\\ < e 2 < 


1 ICs log(2 d/s) 
9\ n 


where C > 0 is a universal constant, and 9 = R fJ ,~N(o,i)9*(l l )l- 1 
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Lemma 5. With probability at least 1 — <5 


Jx) - y) 2 - E (g*(wjx) - y) 2 < i Y^{g{wJxf - yi ) 2 - {g*(wjxf - yf 2 + 6 ( 21 ) 


where O hides factors that are (poly)-logarithmic in n, | 
Proof. From Lemma 6 (i) in a we know that 


fT 2 (r,G,z 1 ,... ,z n ) <^^,0) < -2 2 ^. 

r 


( 22 ) 


where ^(r, Q. z t ..... z „) is the L 2 empirical covering number of function class Q at radius r, and N^ir, Q) is the 
Loo covering number. Using Dudley entropy integral, we can upper bound the empirical Rademacher complexity by 


R n (G) = inf 4a + 10 

a>0 


rl / log(l/r) + ^ ^ < 40 VW 


/n 


Hence, via standard large deviation inequalities we can claim that 

E[(g07 a;) - yf] < - ^ - y ) 2 + 

n z — J V n 

Similarly via standard concentration inequalities we can claim that with probability at least 1 — 5, 

|E[( 5 *OJa:) - y) 2 } - ^T{g*(wlx) - y) 2 \ < 

i 

and hence putting together the above two inequalities the desired result follows. 


(23) 

(24) 

(25) 

□ 


B Proof of Theorem U 

For notational convenience, denote by e 2 = | ' Cs lo ^ 2 4ZU ^ w here C > 0 is a universal constant. Since, w is obtained 
from SILO, we have |jm ||2 < 1, ||i«||i < \J s. The excess risk £(h) can be bounded as follows. 

£{h) = E[( 5 (m T ®) - y) 2 - (g+(wjx) - y) 2 ] 

= E (g(w T x) - g*(wjx)) 2 

= E(g(w T x) - g(wjx) + g(wjx) - g+(wfx)) 2 

< 2 E (g(w T x) - g(wjx)) 2 + 2 E(g(m7 x) - gfwjx)) 2 

< 2 E((u> - wf T xf + 2 E (g(wjx) - g+(wjx)) 2 

(b) ~r to 

< 4se" log(2c?) + 2 £(< 7 ( 11 ; J x) — g*(wl xf with probability at least 1 — <5 (26) 

Where in order to obtain inequality (a) we used the fact that g is 1-Lipschitz, and in order to obtain inequality (b) we 
used Lemma <[3]). We shall now bound the R.H.S. of inequality [26| We do this as follows 

E(<?OJ x) - gfwfx)) 2 < E(g(iuJ x) - y) 2 - E(g*(mJ x) - y) 2 (27) 

(b) 1 U 

< - V ](g{wjXi) - yf) 2 - (g*{u)JXi ) - yf) 2 + Ai (28) 

n z —' 
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In inequality (a) we used Lemma [T] with the function class T = Q o in*. In inequality (b) we used Lemma <[5]» the 
expectation quantity in terms of its empirical quantity, with Ai set to the maximum value of wj x t . We know, from 
Lemma [ 2 ] that this max value is \Js log(2 d) with probability at least 1 — 5. Hence by substituting W = \Js log(2d) 

for W, we get Ai = o(\J ^ slo ^ 2d) 


We have 


. Next we shall try to upper bound the empirical term in the above equation. 


1 

-J2(g{wJxi) -yi) 2 
n z ' 

2 =1 


1 x > 

{g*(wj £C») - Vi) 2 = - V ](g(w T x z ) - yi- g{w T Xi) + g(wj Xi)) 2 - 

n *—' 

2 =1 
1 U 

~y2{g*{w T Xi) - yt - g+(w T Xi) +g i ,(wjx i )) 2 
n L ' 
i =1 


- 1L - IL 

= - ^{g(w T Xi) - yi ) 2 - - - yi ) 2 

n z ' n z ' 



<0 


1 " 

+ - ^2(g(w T x z ) 
n 

2=1 

-g{wjxi)) 2 - 

1 ” 

- ^2(g*(. w J x i) - g(w T Xi)) 2 

n r-' 

2=1 

T\ 


>0 

2 " 

+ - ^{gi^Xi) 
n 

2=1 

- yi)(g{w T x z ) 

-g(wjxi)) 


t 2 


2 " 

n 

2=1 

) - 2/i)(g*(«J T :r : 

i)-g*{w T Xi)) (29) 




where the term marked as < 0 is negative because g is the solution to a minimization problem that minimizes the 
empirical squared error under monotonicity and 1-Lipschitz constraints. Since g r is also monotonic and 1-Lipschitz 
the squared error corresponding to the predictor g(w T x) should be smaller than the squared error corresponding to 
g*(w 1 x). The term marked as > 0 is positive because it is an average of squared quantities. We shall now bound 
T \, T 2 , T 3 as follows 


1 " 

Ti = -y2(g{w T Xi) -g(wjxi)) 2 (30) 

x=i 
1 n 

(3) | x "y —1— n 

< - V((* - Ut») a Zi)) 2 (31) 

n z ' 

1=1 

(b) 0 

< (s + k)e 2 log (2d) (32) 

where, to obtain inequality (a) we used the fact that g is 1-Lipschitz, and to obtain inequality (b) we used Lemma 2. 
To upper bound T 2 we proceed as follows 

2 n 

T 2 = - y2(g(w T Xi) - yi)(g(w T Xi) - g(wjXi)) (33) 

n z —' 

2 — 1 

n n 

< yz \g{w T Xi) -g(wjxi))\ (34) 

n 

2 = 1 

(b) , - 

< ei/ (s + k) log(2d) (35) 
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To obtain inequality (a) we used the fact that | jji — g(w T xf)\ < 1, and to obtain inequality (b) we used the fact that g 
is 1-Lipschitz and Lemma [ 3 ] The same reasoning can be applied to upper bound T 3 to get T 3 < f \/k log(2d). 

Finally using lemma Q, we know that \\w t — tb||| = e 2 < Gathering all the terms, we get with 

probability at least 1 — <5, 

where, 9 = E M _jv(o, 1 )i?(A i )/^ i s a constant that depends on g*. 

C Large Deviation Guarantees for iSILO , ciSILO 

Lemma 6. For any hypothesis h(x) = g(w T x), where W = {id £ : ||id||i < yfs, ||iA71 |2 < 1}, g £ G, w £ W, 

we have 


err (h T ) < err(/i T ) + O > 


where the O hides factors (poly) logarithmic in d,n, 1/5. In particular the above result also applies to hr which is 
the hypothesis obtained by running iSILO or ciSILO for T iterations, and to h, the hypothesis obtained by running 
SILO. 

Before we give the proof of this theorem, we would like to point out that our assumption that w £ W is not at 
all restrictive. In practice the result provided by the iterates of a proximal gradient method used in SILO -M for a 
sufficiently large A are sparse. 

Proof Consider the function class H = {h(x) = g(w T x) : w £ W, g £ G}. By construction, we are guaranteed 
that hf, h £ H, w.h.p., with W = yjs log(2d). In order to establish a large deviation bound on the risk of hj- we 
shall first calculate the worst case Rademacher complexity of 'H. To do this, we establish L 2 covering number of the 
function class H by establishing covering number of U. and L2 covering number of W. Both these results are 
standard. From Lemma 6 in Q we have 


s log (T) + WLUM. 


Since, ||u;||i < y/s, Hx^ < 0(y/log(2 d)), we can use Theorem 3 in |fl 6 l , to conclude that w.h.p. 

log A^(W,e,n)< 5l ° r(2 9 d + 1) . 


(37) 


(38) 


It is not hard to see that 


log A 4 (T,e,n) < logA /2 ( W 


= O 


2V2' 


n + log Jf 00 ( G, 


2V2 


slog 2 ( 2 d + 1 ) 


(39) 

(40) 


Using Lemma A. 1 in m we can bound the worst case Rademacher complexity of H by 


Rn(H) < O 


slog 2 ( 2 d + 1 ) 
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Finally applying Theorem 1 in lil4l we get with probability at least 1 — 6 


err(hr) < err (hr) + O 


\/err(h T )) 


/ slog 2 (2d + 1 ) 


□ 


D Proof of Theorem (J2J) 

Proof. From Theorem Q we know that 

E(h) = err(ft) - err(ft.*) < 6 ^ ( s + + _L 4 ,/( s + fc) log(2d)^ 

Using Lemma| 6 ]we can say that with probability at least 1 — <5 


err (ft) = err(ft) + 


' slog 2 (2d + 1) 


n 


= err (/i*) + err (ft) — err(ft*) + 


'slog 2 (2d + 1) 


= err(ft*) + O 


{ , + *)«, + (£) * + 


(41) 


(42) 

(43) 


Now consider hr obtained by running iSILO for T iterations, when initialized with w. g obtained by running 
SILO first on the data. Since ft-r is chosen by using a held-out validation set as the iterate corresponding to the 
smallest validation error, we can claim via Hoeffding inequality that the empirical error of hr cannot be too much 
larger than that of ft (for otherwise hr will not be the iterate with the smallest validation error). Precisely, if the 
validation set is of size n, then with high probability 

err(ft T ) < err(ft) + O • (44) 


Summing up Equations ( |4T| and ( |42| > we get 

(s + k) log(2d) 


eri(hr) < err(ft*) + O 
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(0 4 y/(s + k) log(2d) 


'slog 2 (2d+ 1) 



(45) 


Now using Theorem to upper bound err(ft^) in terms of err(ft.T’), and combining it with the above bound we get 
the desired result. The same arguments apply even to the ciSILO algorithm. □ 


E Additional Experimental Results 

Here we report results on other high dimensional datasets. Figure [2] again shows the advantage of the calibrated, and 
iterative method ciSILO. Table[I]has the details of the datasets in Figure[2] 
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Figure 2: Comparison of different methods over different datasets. The results are normalized so that the Slisotron has 
error = 1 


Dataset 

n 

d 

Leukamia 

44 

7129 

Eyedata 

120 

200 

Link 

526 

1840 

Page+Link 

526 

4840 

Gisette 

4200 

5000 


Table 1: Dataset details 
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